Big Data Analysis in Finance

Module 2

Author
Affiliation

Prof. Matthew G. Son

University of South Florida

Before begin

Preperation

Make sure to do:

  1. Download data from url
  2. Install packages
# 1. Install packages
install.packages(c("arrow", "curl", "vroom", "fs", "bench", "lobstr"))

# 2. Download Large data
large_data_url <- "https://usfedu-my.sharepoint.com/:u:/g/personal/gson_usf_edu/EUcFVZTsXA9DkxNCDa2SsZsBQQAjCmdv52GFDpVuRL7v-Q?download=1"
curl::multi_download(large_data_url, "option_prices.tsv", resume = TRUE)

The Arrow project

Apache Arrow

  • A cross-language, memory format data specification

  • A standardized way to represent data in memory

  • Designed to accelerate data processing and interoperability

  • Focused on efficient data transfer and sharing

Arrow

Accelerated Data Interchange

Apache Arrow & R

  • R is a powerful language for data analysis and

  • Apache Arrow addresses:

    • Efficient (in-memory) data storage
    • Fast data exchange between R and other languages
    • Machine Learning Pipelines
    • Cross-language Data Collaboration

Feather Project

Arrow:

  • A unified, in-memory columnar data format
  • Supports zero-copy sharing between processes
  • Ideal for cross-language data processing (e.g., R, Python, C++)

Feather:

  • Feather is built on Arrow, using its efficient data representation for disk storage
  • Simplicity makes it perfect for smaller datasets or interim storage

Parquet:

  • Designed for large-scale, analytical workloads
  • Optimal for columnar storage with efficient compression
  • Commonly used in big data ecosystems (e.g., Spark, Hadoop)

The Parquet Project

Intro to Parquet

Parquet is a disk-based, columnar storage format that adheres to the principles of Apache Arrow

While Feather emphasizes speed for quick data exchange, Parquet is designed for deep analytics on massive datasets.

Tabular data structure

Columnar vs. Row Storage

Columnar Storage

  • Data stored by column, not by row

  • Each column’s data is stored together

Row Storage

  • Data stored by row, with all columns together

  • Traditional relational databases often use row storage

Memory Buffer Structure

Columnar Storage

  • Optimized for analytics and querying

  • Excellent compression and encoding capabilities

  • Efficient for analytic querying

Row Storage

  • Suitable for transactional systems

  • Efficient for read/write operations by row

  • Less efficient for analytic queries

Workflow

arrow package for R has low-level interface to C++, and also offers a dplyr backend:

Lazy vs Eager Evaluations

Lazy evaluations

Lazy evaluation postpones the computation of an expression until it is explicitly asked

  • Saves memory by delaying computations

  • Useful for large data operations

  • Explicit call:

    • arrow::collect()

    • or dbplyr::collect()

Eager evaluations

  • Eager evaluation computes immediately when the expression is given

    • Faster when the result is needed

    • Real-time data analysis

    • Interactive programming

Arrow lazy evaluations

arrow::open_dataset() reads the dataset “lazily”

  • It creates a point of link

  • Does NOT read until it was explicitly told

  • collect() should be called to read

c.f.) arrow::read_csv_arrow() or readr::read_csv() reads the file right away (eager)

Benchmarking

What is Code Benchmarking?

There are many different ways to achieve the same goal.

Code benchmarking is the process of:

  • Measuring and analyzing the performance of your R code

  • Identify efficient one among many alternatives

Why does it matter?

Because the data is big.

Probably, the operation time difference between 3s and 1s is not significant.

However, because it usually scales with the size of data:

  • It matters when it comes to 3 hours vs 1 hour
  • or even 3 days vs 1 day

Code Benchmarking

The process of measuring the performance of your code

  • assess execution time

  • resource usage

We will use bench package. Make sure it is installed:

install.packages("bench")

Dummy example

Generate a dummy large data file by repeating and row-binding.

library(nycflights13)
data(flights)
flights <- bind_rows(replicate(3, flights, simplify = F)) # 3 times
flights |> dim()

Then save to .csv file using:

  • utils::write.csv() function
  • readr::write_csv() function
  • arrow::write_csv_arrow() function
  • vroom::vroom_write() function

Dummy example: Writing speed

bench::mark(
  utils::write.csv(flights, "flights.csv"), # default R reader
  readr::write_csv(flights, "flights.csv"),
  arrow::write_csv_arrow(flights, "flights.csv"),
  vroom::vroom_write(flights, "flights.csv", delim = ","),
  check = FALSE
)

Efficient storage

gz compressions

Compression can reduce size of the data. GZ csv file compression with vroom:

bench::mark(
  vroom::vroom_write(flights, "flights.csv.gz", delim = ",")
)

parquet

An arrow storage solution. Parquet has benefit of both worlds:

  • Faster writing and reading
  • Faster processing with Arrow memory representation
  • Smaller file size with snappy compression (by default)
arrow::write_parquet(flights, "flights.parquet")

File size comparison:

fs::file_size("flights.csv")
fs::file_size("flights.csv.gz")
fs::file_size("flights.parquet")

Reading speed benchmarks:

bench::mark(
  utils::read.csv("flights.csv"),
  readr::read_csv("flights.csv"),
  data.table::fread("flights.csv"),
  vroom::vroom("flights.csv", delim = ","),
  arrow::read_csv_arrow("flights.csv"),
  arrow::read_parquet("flights.parquet"),
  check = F
)

Verdict

Use parquet when your data is large: it is lighter, faster storage solution.

In-Class Exercise

Exercise

Let’s work on real financial data:

  • All U.S. Equity options

    • Each stock has many option series
  • in tsv format, 1 GB+ (Random sampling)

  • End-of-day records

fs::file_size("option_prices.tsv")

Read lazily

arrow::open_tsv_dataset() opens data lazily, like a setting up a database connection.

library(arrow)
library(tidyverse)
option_data <- open_tsv_dataset("option_prices.tsv") # lazy reading
option_data |> nrow() # check the number of rows, without reading
Caution

arrow package provides two types of readers: eager and lazy.

read_csv_arrow()is eager reader, while open_csv_dataset() is lazy reader.

Lazy evaluation

Let’s check memory footprint of the data

library(lobstr)
# Not taking much memory
obj_size(option_data)

Out-of-core processing

When the data doesn’t fit your memory, still you can perform analysis with arrow.

Perform below operation:

january_data <- option_data |>
  select(
    optionid,
    date,
    symbol,
    exdate,
    cp_flag,
    strike_price,
    volume,
    best_bid,
    best_offer,
    impl_volatility
  ) |>
  filter(!is.na(impl_volatility), month(date) == 1)

Write subsample

Store them in a separate file: csv

january_data |>
  arrow::write_csv_arrow("january_data.csv")

Or store them in another file: parquet

january_data |>
  arrow::write_parquet("january_data.parquet")

Report the size of the files with R code.

Benchmarking

Use dummy code provided below to check the reading the january_data file

bench::mark(
  readr::read_csv("january_data.csv"),
  arrow::read_csv_arrow("january_data.csv"),
  vroom::vroom("january_data.csv", delim = ","),
  arrow::read_parquet("january_data.parquet"),
  check = FALSE # should NOT check
)

A Note on Parallel processing

CPU architecture

https://www.linkedin.com/pulse/understanding-physical-logical-cpus-akshay-deshpande/

Most jobs are done by single core

https://i.redd.it/s008j9ibbfpx.jpg

Parallel processing

Parallel processing is a technique that utilizes multiple cores, instead of using single core.

Among others, R’s future package handles parallel processing.

Why Not All Cores by Default?

Imagine you’re preparing a simple sandwich. If you invite three friends to help, you each have to coordinate who does what—getting separate cutting boards, passing around ingredients, double-checking steps. The extra coordination might take longer than just making the sandwich yourself!


Overhead is the cost

If the task is small or simple, the overhead can outweigh the benefit of using more cores.

In computing terms, splitting a job across multiple cores similarly involves overhead:

  • Setting up processes
  • Transferring data
  • Merging results

Introduction to HPC

HPC at USF

CIRCE

  • Managed by Research Computing department

  • Two main clusters: CIRCE and Secure Cluster for sensitive data

  • Access through JIRA or email request

USF Research Computing Documentation

Getting Started: login

You can log-in using terminal. For example,

ssh your_username@circe.rc.usf.edu

After login

After login, you’ll be on the login node. Suppose you want to get compute resource from the server, and have an interactive prompt (bash).

Interactive session

srun --pty --nodes=1 --ntasks=1 --time=01:00:00 bash
  • From 1 node, 1 task

  • For 1 hour limit

  • Launches bash shell

Sending Jobs

Usually jobs on the server are run with batch scripts.

For an instance, write a shell script file called my_job.sh such as:

#!/bin/bash
#SBATCH --job-name=hello-world
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --mem=2G

module load R
srun Rscript hello.R

Then submit the job with:

sbatch job.sh

You can monitor your job with:

squeue -u $USER

Suggested Reading

  1. Apache Arrow in R (https://arrow-user2022.netlify.app/hello-arrow.html)
  2. Hadley, “R for Data Science” 2ed,
    1. Ch. 23. Arrow